interspeech 2022
Non Intrusive Intelligibility Predictor for Hearing Impaired Individuals using Self Supervised Speech Representations
Close, George, Hain, Thomas, Goetze, Stefan
Self-supervised speech representations (SSSRs) have been successfully applied to a number of speech-processing tasks, e.g. as feature extractor for speech quality (SQ) prediction, which is, in turn, relevant for assessment and training speech enhancement systems for users with normal or impaired hearing. However, exact knowledge of why and how quality-related information is encoded well in such representations remains poorly understood. In this work, techniques for non-intrusive prediction of SQ ratings are extended to the prediction of intelligibility for hearing-impaired users. It is found that self-supervised representations are useful as input features to non-intrusive prediction models, achieving competitive performance to more complex systems. A detailed analysis of the performance depending on Clarity Prediction Challenge 1 listeners and enhancement systems indicates that more data might be needed to allow generalisation to unknown systems and (hearing-impaired) individuals
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > United Kingdom > Wales (0.04)
- Europe > United Kingdom > Scotland (0.04)
- (3 more...)
Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding
Recently, large pretrained language models have demonstrated strong language understanding capabilities. This is particularly reflected in their zero-shot and in-context learning abilities on downstream tasks through prompting. To assess their impact on spoken language understanding (SLU), we evaluate several such models like ChatGPT and OPT of different sizes on multiple benchmarks. We verify the emergent ability unique to the largest models as they can reach intent classification accuracy close to that of supervised models with zero or few shots on various languages given oracle transcripts. By contrast, the results for smaller models fitting a single GPU fall far behind. We note that the error cases often arise from the annotation scheme of the dataset; responses from ChatGPT are still reasonable. We show, however, that the model is worse at slot filling, and its performance is sensitive to ASR errors, suggesting serious challenges for the application of those textual models on SLU.
- Europe > United Kingdom > England (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
Defense Against Adversarial Attacks on Audio DeepFake Detection
Kawa, Piotr, Plata, Marcin, Syga, Piotr
Audio DeepFakes (DF) are artificially generated utterances created using deep learning, with the primary aim of fooling the listeners in a highly convincing manner. Their quality is sufficient to pose a severe threat in terms of security and privacy, including the reliability of news or defamation. Multiple neural network-based methods to detect generated speech have been proposed to prevent the threats. In this work, we cover the topic of adversarial attacks, which decrease the performance of detectors by adding superficial (difficult to spot by a human) changes to input data. Our contribution contains evaluating the robustness of 3 detection architectures against adversarial attacks in two scenarios (white-box and using transferability) and enhancing it later by using adversarial training performed by our novel adaptive training. Moreover, one of the investigated architectures is RawNet3, which, to the best of our knowledge, we adapted for the first time to DeepFake detection.
- Europe > Poland > Lower Silesia Province > Wroclaw (0.05)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > France (0.04)
A Study on the Reliability of Automatic Dysarthric Speech Assessments
Cadet, Xavier F., Aloufi, Ranya, Ahmadi-Abhari, Sara, Haddadi, Hamed
Automating dysarthria assessments offers the opportunity to develop effective, low-cost tools that address the current limitations of manual and subjective assessments. Nonetheless, it is unclear whether current approaches rely on dysarthria-related speech patterns or external factors. We aim toward obtaining a clearer understanding of dysarthria patterns. To this extent, we study the effects of noise in recordings, both through addition and reduction. We design and implement a new method for visualizing and comparing feature extractors and models, at a patient level, in a more interpretable way. We use the UA-Speech dataset with a speaker-based split of the dataset. Results reported in the literature appear to have been done irrespective of such split, leading to models that may be overconfident due to data-leakage. We hope that these results raise awareness in the research community regarding the requirements for establishing reliable automatic dysarthria assessment systems.
INTERSPEECH 2022 -- My First Conference Experience
Our company mainly focuses on building services to facilitate a better understanding of what event is taking place in the environmental sound scene (e.g. The conference on INTERSPEECH, one of the biggest conferences on the science and technology of spoken language processing, was held at Songdo ConvensiA, in Incheon, South Korea, from Sep. 18 to 22, 2022. Integrating two previous series of conferences (EUROSPEECH and ICSLP), the first INTERSPEECH was held in 2000, in Beijing. Since then, INTERSPEECH has gained popularity and held the 23rd event this year, 2022. Despite a small discrepancy between our main focus and the conference theme, because our company primarily concentrates on nonverbal audio signals other than speech itself, many papers from INTERSPEECH have aided our research so far.
- Asia > South Korea > Incheon > Incheon (0.26)
- Asia > China > Beijing > Beijing (0.26)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.06)
- (2 more...)